Search Engine Spiders Lost Without Guidance - Post This Sign!

Released on = September 23, 2006, 10:12 pm

Press Release Author = Cpxclick.com Inc

Industry = Internet & Online

Press Release Summary = The robots.txt file is an exclusion standard required by all
web crawlers/robots to tell them what files and directories that you want them to
stay OUT of on your site. Not all crawlers/bots follow the exclusion standard and
will continue crawling your site anyway. I like to call them \"Bad Bots\" or
trespassers. We block them by IP exclusion which is another story entirely.
This is a very simple overview of robots.txt basics for webmasters. For a complete
and thorough lesson, visit Robotstxt.org.

Press Release Body = To see the proper format for a somewhat standard robots.txt
file look directly below. That file should be at the root of the domain because that
is where the crawlers expect it to be, not in some secondary directory.

Below is the proper format for a robots.txt file ----->

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /group/

User-agent: msnbot
Crawl-delay: 10

User-agent: Teoma
Crawl-delay: 10

User-agent: Slurp
Crawl-delay: 10

User-agent: aipbot
Disallow: /

User-agent: BecomeBot
Disallow: /

User-agent: psbot
Disallow: /

--------> End of robots.txt file

This tiny text file is saved as a plain text document and ALWAYS with the name
\"robots.txt\" in the root of your domain.

A quick review of the listed information from the robots.txt file above follows. The
\"User Agent: MSNbot\" is from MSN, Slurp is from Yahoo and Teoma is from AskJeeves.
The others listed are \"Bad\" bots that crawl very fast and to nobody\'s benefit but
their own, so we ask them to stay out entirely. The * asterisk is a wild card that
means \"All\" crawlers/spiders/bots should stay out of that group of files or
directories listed.

The bots given the instruction \"Disallow: /\" means they should stay out entirely and
those with \"Crawl-delay: 10\" are those that crawled our site too quickly and caused
it to bog down and overuse the server resources. Google crawls more slowly than the
others and doesn\'t require that instruction, so is not specifically listed in the
above robots.txt file. Crawl-delay instruction is only needed on very large sites
with hundreds or thousands of pages. The wildcard asterisk * applies to all
crawlers, bots and spiders, including Googlebot.

Those we provided that \"Crawl-delay: 10\" instruction to were requesting as many as 7
pages every second and so we asked them to slow down. The number you see is seconds
and you can change it to suit your server capacity, based on their crawling rate.
Ten seconds between page requests is far more leisurely and stops them from asking
for more pages than your server can dish up.

(You can discover how fast robots and spiders are crawling by looking at your raw
server logs - which show pages requested by precise times to within a hundredth of a
second - available from your web host or ask your web or IT person. Your server logs
can be found in the root directory if you have server access, you can usually
download compressed server log files by calendar day right off your server. You\'ll
need a utility that can expand compressed files to open and read those plain text
raw server log files.)

To see the contents of any robots.txt file just type robots.txt after any domain
name. If they have that file up, you will see it displayed as a text file in your
web browser. Click on the link below to see that file for Amazon.com

http://www.Amazon.com/robots.txt

You can see the contents of any website robots.txt file that way.

The robots.txt shown above is what we currently use at Publish101 Web Content
Distributor, just launched in May of 2005. We did an extensive case study and
published a series of articles on crawler behavior and indexing delays known as the
Google Sandbox. That Google Sandbox Case Study is highly instructive on many levels
for webmasters everywhere about the importance of this often ignored little text
file.

One thing we didn\'t expect to glean from the research involved in indexing delays
(known as the Google Sandbox) was the importance of robots.txt files to quick and
efficient crawling by the spiders from the major search engines and the number of
heavy crawls from bots that will do no earthly good to the site owner, yet crawl
most sites extensively and heavily, straining servers to the breaking point with
requests for pages coming as fast as 7 pages per second.

We discovered in our launch of the new site that Google and Yahoo will crawl the
site whether or not you use a robots.txt file, but MSN seems to REQUIRE it before
they will begin crawling at all. All of the search engine robots seem to request the
file on a regular basis to verify that it hasn\'t changed.

Then when you DO change it, they will stop crawling for brief periods and repeatedly
ask for that robots.txt file during that time without crawling any additional pages.
(Perhaps they had a list of pages to visit that included the directory or files you
have instructed them to stay out of and must now adjust their crawling schedule to
eliminate those files from their list.)

Most webmasters instruct the bots to stay out of \"image\" directories and the
\"cgi-bin\" directory as well as any directories containing private or proprietary
files intended only for users of an intranet or password protected sections of your
site. Clearly, you should direct the bots to stay out of any private areas that you
don\'t want indexed by the search engines.

The importance of robots.txt is rarely discussed by average webmasters and I\'ve even
had some of my client business\' webmasters ask me what it is and how to implement it
when I tell them how important it is to both site security and efficient crawling by
the search engines. This should be standard knowledge by webmasters at substantial
companies, but this illustrates how little attention is paid to use of robots.txt.

The search engine spiders really do want your guidance and this tiny text file is
the best way to provide crawlers and bots a clear signpost to warn off trespassers
and protect private property - and to warmly welcome invited guests, such as the big
three search engines while asking them nicely to stay out of private areas.



Web Site = http://www.cpxclick.com

Contact Details = CPXclick enables web-publishers to realize tangible revenues from
their immediate traffic by turning user clicks into real-time on-the-spot payoffs.
By partnering with the top Advertisers and PPC search engines, CPXclick provides its
affiliates with the highest paying results and hence higher earning potential per
each click. The high-quality network of hundreds of leading web-publishers that
CPXclick boasts enables CPXclick to negotiate revenue sharing agreements whereby its
affiliate Publishers attain the bargaining power they would not be able to achieve
alone.

  • Printer Friendly Format
  • Back to previous page...
  • Back to home page...
  • Submit your press releases...
  •